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Abstract 

A problem posed by Freund is how to efficiently 
track a small pool of experts out of a much larger 
set. This problem was solved when Bousquet and 
Warmuth introduced their mixing past posteriors 
(MPP) algorithm in 2001. 

In Freund's problem the experts would normally 
be considered black boxes. However, in this paper 
we re-examine Freund's problem in case the ex- 
perts have internal structure that enables them to 
learn. In this case the problem has two possible 
interpretations: should the experts learn from all 
data or only from the subsequence on which they 
are being tracked? The MPP algorithm solves the 
first case. Our contribution is to generalise MPP to 
address the second option. The results we obtain 
apply to any expert structure that can be formal- 
ised using (expert) hidden Markov models. Curi- 
ously enough, for our interpretation there are two 
natural reference schemes: freezing and sleeping. 
For each scheme, we provide an efficient predic- 
tion strategy and prove the relevant loss bound. 

1 Introduction 

Freund's problem arises in the context of prediction with ex- 
pert advice [2 |. In this setting a sequence of outcomes needs 
to be predicted, one outcome at a time. Thus, prediction 
proceeds in rounds: in each round we first consult a set of 
experts, who give us their predictions. Then we make our 
own prediction and incur some loss based on the discrep- 
ancy between this prediction and the actual outcome. The 
goal is to minimise the difference between our cumulative 
loss and some reference scheme. For this reference there are 
several options; we may, for example, compare ourselves to 
the cumulative loss of the best expert in hindsight. A more 
ambitious reference scheme was proposed by Yoav Freund 
in 2000. 
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Freund's Problem Freund asked for an efficient predic- 
tion strategy that suffers low additional loss compared to the 
following reference scheme: 

(a) Partition the data into several subsequences. 

(b) Select an expert for each subsequence. 

(c) Sum the loss of the selected experts on their subsequences. 

In 2001, Freund's problem was addressed by Bousquet and 
Warmuth, who developed the efficient mixing past posteri- 
ors (MPP) algorithm [1]. MPP's loss is bounded by the 
loss of Freund's scheme plus some overhead that depends 
on the number of bits required to encode the partition of the 
data, and it has found successful application in [5 1. Problem 
solved. Or is it? 

The Loss of an Expert on a Subsequence In our view 
Freund's problem has two possible interpretations, which dif- 
fer most clearly for learning experts. Namely, to measure the 
predictive performance of an expert on a subsequence, do we 
show her the data outside her subsequence or not? An expert 
that sees all outcomes will track the global properties of the 
data. This is (implicitly) the case for mixing past posteriors. 
But an expert that only observes the subsequence that she 
has to predict might see and thus exploit its local structure, 
resulting in decreased loss. The more the characteristics of 
the subsequences differ, the greater the gain. Let us illustrate 
this by an example. 

Ambiguity Example The data consist of a block of ones, 
followed by a block of zeros, again followed by a block of 
ones. In the reference scheme step Q, we split the data 
into two subsequences, one consisting of only ones, the other 
consisting of only zeros. Our expert predicts the probability 
of a one using Laplace's rule of succession, i.e. she learns 
the frequency of ones in the data that she observes [2 |. Note 
that one learning expert suffices, as we can select her for both 
subsequences. 

First we consider the expert's predictions when she ob- 
serves all data. During the first block, our expert will in- 
crease her probability of a one from 1/2 to nearly one. Then 
during the second block it will go down to l ji again. During 
the third block it will increase from 1/2 up, but slower. Thus, 
for block two the expert is extremely bad, while for block 
three she is at best mediocre. (See Figure[T|) 

Compare this to the expert's predictions on the subsequences. 
During the subsequence of ones (first and third block), our 



Figure 1 Estimated probability of a one 




Figure 2 Freezing (consecutive) and Sleeping (timing pre- 
served). Both use experts A,B,C on subsequences 1,2,3. 
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expert will increase her probability of a one from x ji to al- 
most one, while during the subsequence of zeroes she will 
decrease her probability from x /2 to nearly zero. Thus, the 
expert is much better on the subsequences in isolation. 

This shows that the predictive performance of a learn- 
ing expert on a subsequence in isolation can be dramatically 
higher than that on the same sequence in the context of all 
data. This behaviour is typical: on all data a learning expert 
will learn the average, global pattern, while on a well-chosen 
subsequence she can zoom in on the local structure. 

Structured Experts In this paper, we solve Freund's prob- 
lem under the interpretation that experts only observe the 
subsequence on which they are evaluated. Of course, for 
arbitrary experts, this is impossible. For in the setting of 
prediction with experts, the expert predictions that we re- 
ceive each round are always in the context of all data. We 
have no access to the experts' predictions in the context of 
any subsequence, which may differ drastically from those on 
the whole data. 

Often however, experts have internal structure. For ex- 
ample, in ifTTl [8l n~5l [161 adaptive prediction strategies (i.e., 
learning experts) are explicitly constructed from basic ex- 
perts. To represent such structured experts, we use a general 
framework called expert hidden Markov models (EHMMs), 
that was introduced in |9|. EHMMs are hidden Markov mod- 
els in which the production probabilities are determined by 
expert advice. A structured expert in EHMM form provides 
sufficient information about its predictions on any isolated 
subsequence. 

Many strategies for prediction with expert advice (i.e. 
learning experts) can be rendered as EHMMs. For example 
all adaptive strategies in the papers above (see [9|). But 
there are also strategies that cannot be brought into EHMM 
form, like e.g. follow the perturbed leader [6J and variable 
share J8J. 

Our approach may also be of interest to machine learning 
with regular hidden Markov models (HMMs) [ 12 1. Although 
existing approaches to shift between multiple HMMs [3] |4j 
10 1 usually focus on change-point detection, prediction seems 
a highly related issue. 

Sleeping or Freezing We evaluate the performance of learn- 
ing experts on subsequences in isolation. But now another 
choice presents itself (see Figure |2j. Should we present the 
subsequence to the expert consecutively (we view this as 
freezing the expert's state on other data)? Or should we retain 
the original timing of the selected samples and keep the inter- 
mediate samples unobserved (then the expert is sleeping for 
other data)? To illustrate the difference, consider an expert 
that is able to predict the television images of our favourite 
show. We want to freeze her during commercial breaks, so 
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that she continues predicting the show where it left off. We 
want to put her to sleep when we zap to another channel, so 
that after zapping back, she will predict the show as it has 
advanced. Thus freezing vs sleeping is a modelling decision, 
that should be made on a case-by-case basis. We cover both 
scenarios. 

1.1 Overview 

After preliminaries we start by reviewing the main existing 
loss bound for mixing past posteriors in Sj3] Then, in Sj4] we 
introduce EHMMs as a way to represent structured experts. 

The next section, Sj5] contains our results for Freund's 
problem when structured experts are evaluated on isolated 
subsequences. We formalise sleeping and freezing as two 
different ways of presenting a subsequence of the data to an 
EHMM, and present the evolving past posteriors EPP al- 
gorithm that takes an EHMM as input. The EPP algorithm 
has two variants, which both generalise the mixing past pos- 
teriors algorithm in a different way: EPP-SLEEPING for sleep- 
ing and EPP-FREEZING for freezing. The relation between 
EPP and other existing prediction strategies is shown in Fig- 
ure^ There A — »• B means that by carefully choosing pre- 
diction strategy A's parameters it reduces to strategy B. 

In order to understand EPP, we verify that it produces 
the same predictions for any two EHMMs that are equival- 
ent in an appropriate sense, and analyse its running time. We 
then proceed to show our main result, which is that the losses 
of EPP-Freezing and EPP-Sleeping are bounded by the 
loss of Freund's scheme plus a complexity penalty that de- 
pends on the number of bits required to encode the reference 
partition in the same way as for mixing past posteriors. In 
fact, our bounds (slightly) improve the known loss bound 
for mixing past posteriors. Thus we solve Freund's problem 
with learning experts presented as EHMMs, both for freez- 
ing and for sleeping. 

We first derive our results only for logarithmic loss. This 
allows us to use familiar concepts and results from probab- 
ility theory and refer to the interpretation of log loss as a 
codelength [2|. In Sj6]we conclude by proving that any al- 
gorithm that satisfies certain weak conditions, in particular 
EPP, directly generalises to an algorithm for arbitrary mix- 
able losses with the appropriate loss bounds. 

2 Preliminaries 

Prediction With Expert Advice Each round t, we first re- 
ceive advice from each expert e e £ in the form of an action 



Figure 3 Generalisation relation among prediction strategies 
EPP-Freezing EPP-Sleeping 

Forward Algorithm Mixing Past Posteriors 




Bayes 



af £ A. Then we distill our own action a^ 9 E A from 
the expert advice. Finally, the actual outcome x t e X is ob- 
served, and everybody suffers loss as specified by a fixed loss 
function i: Ax X ^ [0, oo]. Thus, the performance of a se- 
quence of actions ai • • • Or upon data x\ ■ ■ ■ Xt is measured 
by the cumulative loss Y^t=i £{ a t,xt)- 

Log Loss For log loss the actions A are probability distri- 
butions on X and ^ (p, x) — — log p(x), where log denotes 
the natural logarithm. It is important to notice that minimiz- 
ing log loss is equivalent to maximizing the predicted prob- 
ability of outcome x. We write p^ for the prediction of expert 
e at time t and denote these predictions jointly by pf . 

Subsequences For to < n, we abbreviate {to, . . . , n} to 
mm. For completeness, we set m:n — for to > n. For any 
sequence y\, y<i, . . . and any set of indices C = {ii, «2, ■ • ■} 
we write yc for the subsequence (yi) iGC . For example, xq = 
(xi) ieC and pf. T = pf , . . . ,Py. If members of a family 
C = {Ci,C2, ■ ■ •} are pairwise disjoint and together cover 
V.T (U C = 1:T), then we call C a partition of V.T, and its 
members cells. 

3 Mixing Past Posteriors 

Mixing past posteriors (MPP) is a strategy for prediction 
with expert advice. It operates by maintaining a table of 
so-called posterior distributions on the set of experts. Each 
round, we first compute the predictive distribution on ex- 
perts by mixing all the posteriors in the table. Then the next 
outcome is predicted by mixing the expert predictions ac- 
cording to this distribution. Finally, the next outcome is ob- 
served. The predictive distribution on experts is conditioned 
on this outcome, and the posterior distribution thus obtained 
is appended to the table of posteriors. Note the recursive 
construction of the distributions in the table; they are not 
Bayesian posteriors, but conditioned mixtures of all earlier 
distributions from that same table. 

We will not formally introduce MPP here, but recover it 
as a special case of both the freezing and sleeping algorithms 
in |5.4| Here we state the classical loss bound [1 The- 
orem 7], introducing our notation along the way. This loss 
bound relates the loss of MPP to Freund's scheme, where we 
choose a partition of the data (step|a]) and select an expert for 
each partition cell (step|b]i. We measure expert performance 
(step|c]) using the predictions issued in the context of all data, 
i.e. the traditional interpretation of Freund's scheme. 

3.1 Loss Bound 

We bound the overhead of MPP over Freund's scheme in 
terms of the complexity of the reference partition. We first 
state the theorem, and then explain the ingredients. We write 



P w { x i-.t) for the probability that MPP assigns to data x\-t 
(so - log(P^ pp (a;i :T )) is MPP's cumulative log loss). 

Theorem 1 (JT] Theorem 7]). For any mixing scheme (3, 
Bayesian joint distribution P B with prior distribution w on 
experts, partition C of V.T, data X\-t and expert predictions 

prw) > /3(c)p b (x 1:T ). (i) 

A mixing scheme (3 is a sequence j3\ , @2 , ■ ■ ■ of distributions, 
where is a probability distribution on 0:j. In JT| several 
mixing schemes are listed, e.g. Uniform Past and Decaying 
Past. A mixing scheme is turned into a distribution on parti- 
tions as follows. Let C be a partition of V.T, and let i € V.T. 
The cell of i, denoted C(i), is the unique C £ C such that 
i G C. We write prev c (i) for the predecessor ofi, defined as 
the largest element in C(i) U {0} that is smaller than i. Using 
this notation, the distribution on partitions is given by 

/3(C) := J] /3 t (prev c (<)). 
tei-.T 

Note that this distribution is potentially defective; two ele- 
ments i < j cannot share the same nonzero predecessor, but 
[3i may assign nonzero probability to prev c (j) nonetheless. 

Now that we have seen how the loss bound encodes par- 
tition, we turn to P^(xi : t), the probability of the data x\-t 
given a particular partition C. To compute it, we treat the 
cells independently and per cell we use the Bayesian 
mixture with prior w on experts ([3]), thus mixing the predic- 
tions the experts issued in the context of all data Q. 

Pc(^1:t) n P c( x c), where (2) 

CGC 

P£(z c ):=5>(e)Pc(*c)and (3) 

Pc(*c):=nPf(^)- (4) 

iec 

A second bounding step allows us to relate the performance 
of MPP directly to Freund's scheme. Let w be the uniform 
prior over a finite set of experts £, and select an expert e for 
each partition cell C E C. Then bound each sum ^ from 
below by one of its terms to obtain 

Corollary 1. Pr P (*i:T) > /3(C) \S |" |C| J[ p£(x c ). 

cec 

Thus the log-loss overhead of MPP over Freund's scheme is 
bounded by — log /3(C) + |C| log|£|, which can be related 
to the number of bits to encode the chosen partition and the 
selected experts for each cell IT) . 

Convex Combinations In fl], the authors make a point 
of selecting a convex combination of experts for each sub- 
sequence, where the loss of a convex combination of experts 
is the weighted average loss of the experts. The loss of such a 
convex combination is therefore always higher than the loss 
of its best expert. Uniform bounds in terms of arbitrary ex- 
perts, like Corollary [T] apply in particular to the best expert, 
and hence to any convex combination. Therefore, without 
loss of generality, we do not discuss convex combinations 
any further. 



Figure 4 Bayesian Network specification of an EHMM 



Figure 5 Hidden state transitions in slot machine HMM 
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Interpreting Freund's Problem This loss bound shows 
that MPP solves the black-box-experts interpretation of Freund's 
problem. This can be seen clearly in Q. To predict the sub- 
sequence xc, it uses predictions p£ which were issued in the 
context of all data. This means that the experts observe the 
entire history xi-i before predicting the next outcome Xi+±. 

Switching between learning experts that observe all data 
is useful when the data are homogeneous, and the experts 
learn its global pattern at different speeds. In such cases 
we want to train each expert on all observations, for then by 
switching at the right time, we can predict each outcome us- 
ing the expert that has learned most until then. This scenario 
is analysed in fl4l . where experts are parameter estimators 
for a series of statistical models of increasing complexity. 

On the other hand, if the data have local patterns then 
our new interpretation of Freund's problem applies, and we 
want to train each expert on the subsequence on which it 
is evaluated, so that it can exploit its local patterns. To solve 
Freund's problem for such learning experts, we need to know 
about its internal structure. 

4 Structured Experts 

Assume there is only a single expert and fix a reference par- 
tition. Suppose we want to predict as if the expert is restarted 
on each cell of the partition, when in reality the expert just 
makes her predictions as if all the data were in a single cell. 
Then clearly this is impossible if we treat this expert com- 
pletely as a black box: if we do not know what the expert's 
predictions would have been if a certain outcome were, say, 
the start of a new cell, then we cannot match these predic- 
tions. 

The expert therefore needs to reveal to us some of her 
internal state. To this end, we will represent the parts of her 
internal state that will not be revealed to us by lower level 
experts that we will treat as black boxes, and assume our 
main expert combines the predictions of these base experts 
using an expert hidden Markov model (EHMM). 

4.1 EHMMs 

Expert Hidden Markov Models (EHMMs) were introduced 
in (9] as a language to specify strategies for prediction with 
expert advice. We briefly review them here. An EHMM f) 
is a probability distribution that is constructed according to 
the Bayesian network in Figure [4] It is used to sequentially 
predict outcomes X\, X%, . . . which take values in outcome 
space X. At each time t, the distribution of X t depends on 
a hidden state Q t , which determines mixing weights for the 
experts' predictions. Formally, the production function p, 
determines the interpretation of a state: it maps any state 



qt G Q to a distribution p** on the identity E t of the expert 
that should be used to predict X t . Then given E t = e, the 
distribution of X t is base expert e's prediction p^ . It remains 
to define the distribution of the hidden states. The starting 
state Qi has initial distribution p D , and the state evolves ac- 
cording to the transition function p^, which maps any state 
qt to a distribution p5 on states. 

An EHMM defines a prediction strategy as follows; 
after observing Xi-.t, predict outcome X t +i using the mar- 
ginal S)(X t+ i \xi-.t), which is a mixture of the expert's pre- 
dictions pf +1 . 

Example 4.1 (Any Ordinary HMM). To illustrate how or- 
dinary HMMs are a special case of EHMMs, consider the 
following naive gambler's HMM model of an old-fashioned 
slot machine: in each round the gambler inserts one nickel 
into the slot machine and then the machine pays out a certain 
number of nickels depending on its hidden internal state: in 
state Cold it pays out nothing; in state Hot it pays out an 
amount between one and five nickels, uniformly at random; 
and then there's Jackpot in which it always pays out ten nick- 
els. The machine always starts in state Cold and the state 
transitions are as in Figure [5] 

To make an EHMM out of this HMM, we just identify 
experts with states: Q = £ = {Cold, Hot, Jackpot}, pj(e) = 
1, and each expert predicts according to the corresponding 
payout scheme. The distributions on states follow the ori- 
ginal HMM: p D (Cold) = 1 and p^ as in Figure|5] O 

Example 4.2 (Bayes on base experts). We identify the 
Bayesian distribution with prior w on base experts £ and the 
EHMM with Q = £ , p D = w, and pf i (e) = pj(e) = 1, 
since their marginals coincide. Despite its deceptive simpli- 
city, this EHMM learns: its marginal distribution on the next 
outcome is a mixture of the expert's predictions according to 
the Bayesian posterior. O 

Example 4.3 (Bayes on EHMMs). Fix EHMMs f) 1 , . . . , Sj n 
with disjoint state spaces and the same basic experts, and 
let w be a prior distribution on l:n. The Bayesian mixture 
EHMM has state space Q = \J i Q\ and for any two states 
q, q' € Q 1 belonging to the same original EHMM, p a (q) = 

w(i)d(q), V q M) = pW) andp?(e) = pf (e). Again, 
this EHMM learns which of the given EHMMs is the best 
predictor. O 



4.2 The Forward Algorithm 

Sequential predictions for EHMMs can be computed effi- 
ciently using the forward algorithm, which maintains a pos- 
terior distribution over states, and predicts each outcome with 
a mixture of the experts' predictions J9). Given a posterior 
^t(Qt) — -^(Qtl^i:*— i) for the hidden state at time t, the 
forward algorithm predicts x t using the marginal of f)(Q t , E t , X t \xi :t -i) 



Figure 6 Sleeping and Freezing on outcomes £{2.4,5,...} 
(a) Sleeping: EHMM #{2,4,5,...} 
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(b) Freezing: EHMM #{2,4,5,...} 




Then, after observing outcome xt, it updates its posterior X t 
for Q t to a posterior Xt+i for Qt+i- 

For finite Q, £ and X, the running time of the algorithm 
is determined by this last posterior update step, which in gen- 
eral may require 0(|Q| 2 ) computation steps for each round 
t. On T outcomes, this gives a total running time of 0(|Q| 2 • 
T). In Appendix[X]we provide a more careful analysis. 

5 Freezing & Sleeping 

Let Xi : t = x±, . . . , Xx be a sequence of data and suppose 
that a reference partition C of 1:T is given in advance. We 
are interested in the performance of a structured expert S)c 
which for each cell C G C runs a separate instance of the 
structured expert S) on the subsequence xq. This leaves un- 
specified, however, whether the original timing of xq should 
be preserved when xq is presented to F). This is a model- 
ling choice, which depends on the application at hand. We 
therefore treat both the case where the timing is preserved, 
which we call sleeping, and the case where the timing is not 
preserved, which we call freezing. (See also Figure [2] in the 
introduction.) 

Sleeping We say that the instance of f) that is used to pre- 
dict cell C is sleeping if it does notice the passing of time 
during outcomes outside of C, even though it does not ob- 
serve them. We write Sj^ for the resulting EHMM, which is 
shown in Figure 6a for the example C = {2, 4, 5, . . .}. No- 
tice that f)^ 1 contains all five states Q1.5, even though it does 
not observe x\ or X3, This has the effect that state transitions 
from e.g. Q2 to Q4 are composed of two transition steps ac- 
cording to . The distributions on individual cells combine 
into the following distribution on all data xi-.t- 



SjUxut) :=U^(x e ). 



cec 



them: a channel not being watched is not paused, but in- 
stead continues broadcasting even when its content is not 
observed. 

Freezing In freezing, the instance of f) that is used to pre- 
dict cell C E C is frozen when outcomes outside of C occur: 
its internal state should not change based on those outcomes. 
(Of course we have no control over the base experts on which 
$) is based, so they may do whatever they please with such 
data. We therefore do have to preserve the timing of the base 
experts' predictions.) The resulting EHMM f)^ is shown for 



the example C = {2, 4, 5, . . .} in Figure 6b] Note that Q2, 
Q4 and Q5 are the first, second and third state of f)^; state 
transitions between them consist of a single transition step 
according to p^. The resulting distribution on all data is 
defined by 

SjKxut) ■= n^c(ze). 

cec 

One might associate freezing with the way different e-mail 
conversations get interleaved in your inbox (if it is sorted by 
order of message arrival): a conversation about your latest 
research is paused (remains frozen) regardless of how much 
spam you receive in between. 

5.1 An Infeasible Solution 

The freezing or sleeping distributions can be computed if the 
reference partition C is given in advance. The problem we 
are addressing, however, is that we do not assume C to be 
known. An easy (but impractical) solution to this problem is 
to predict according to the Bayesian mixture of all possible 
partitions: let w be a prior on the set of all possible partitions 
and predict such that the joint distribution on all data is given 
by 



f8(x) : =^^(C)^ s (x), 



where f/s denotes either fr for freezing or si for sleeping. 
Lower bounding the sum by the term for the reference parti- 
tion C directly gives an upper bound on the log loss: 



log*B(x) < — \ogw( 



logi^ s (X) 



To predict according to 25 in general would require an expo- 
nential amount of state to keep track of all possible partitions, 
which is completely impractical. In the following section we 
therefore present generalisations to both sleeping and freez- 
ing of the mixing past posteriors algorithm and show that 
their running time is comparable to that of the forward al- 
gorithm on S) itself. Then in section S5.3 we prove bounds 



To memorize the nature of sleeping, one may think of the 
way television channels get interleaved as you zap between 



that relate the additional loss to the encoding cost of the ref- 
erence partition C. 

5.2 The EPP Algorithm 

Here we present a generalisation of the mixing past posteri- 
ors (MPP) algorithm, which we call evolving past posteri- 
ors (EPP). It is based on the view that MPP internally uses 
the Bayesian mixture of base experts, which is a standard 
EHMM. Given this perspective and after making the dis- 
tinction between sleeping and freezing, the generalisation to 
other EHMMs is straightforward. We will discuss the con- 
nections between MPP and EPP in more detail in 95.41 



Algorithm 1 EPP: Evolving Past Posteriors 
Input: 

• An EHMM Sj with components p Q , and (see jQ 



• A mixing scheme /3i,/?2> ■ • • ( see ^ 3.1 and f 5.2.2 1 

• Expert predictions pf , pf , . . . and data xi,X2, . . . 

Output: Predictions p^' 9 . p^' 9 , . . . 

Storage: Past posteriors 7r 1; 7r 2 , . . . on Q, the states of f) 
Algorithm 

1 : Set the first posterior to the initial distribution of 5) 

7Tl(<7l) <-Po(?l) 

2: fori = 1,2,... do 

3: Form A t , the current configuration, as the f3 t -mix- 
ture of past posteriors: 

0<j<t 

4: Compute pf 9 , the joint distribution on states, ex- 
perts and outcomes: 

pf 9 (q t ,e t ,x t ) <- A t (q t )pf (e t )pt t (a; t ). 

5: Predict x t using the marginal pf 3 (x t ), 

6: Observe x*. Suffer log loss 

^ l9 ^~log(pf 9 (x t )). 

7: Perform loss update and state evolution to obtain 
the next posterior 

-Kt+xilt+x) <- Pt 9 (<lt\xt)p^{qt+i)- 

q±eQ 

8: Only for sleeping: perform state evolution for all 
past posteriors (1 < j < t) 

9t6Q 

9: end for 



The EPP algorithm has variants for sleeping and freez- 
ing, which are both given in Algorithm[T] It takes an EHMM 

as input. Given a dis- 



f) and mixing scheme /3 (see £3.1 1 
tribution A t on the hidden state Q t at time t, the EPP al- 
gorithm predicts X t exactly like the forward algorithm. It 
differs from the forward algorithm, however, in the way it 
computes X t . Whereas in the forward algorithm A t may be 
interpreted as the posterior distribution on Q t , in the EPP 
algorithm At is a /3-mixture of the algorithm 's own past pos- 
teriors. This recursive nature of EPP, which it inherits from 
the MPP algorithm, makes it hard to analyse. 

We denote by P f ^ and P^J the probability distributions 
on (Q u E u X t ) ten defined by EPP-Freezing and EPP- 
SLEEPING on EHMM f) and mixing scheme /3. For both 

f/se{sl,fr} 



n pt 5 (qt,e t ,x t ). 



Table 1 Mixing schemes 
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5.2.1 Representation Invariance 

Let S) 1 and Sj 2 be EHMMs that are based on the same set 
of experts £, but have different state spaces. We call f) 1 and 
i} 2 equivalent if ^(ei-x) = Sj 2 (ei-T) for all e-y-T- Con- 
sequently, equivalent EHMMs assign the same probability 
^(xi-t) = $) 2 (xi;t) to all data Xi : x, hence the difference 
between S) 1 and Sj 2 is merely a matter of representation. As 
an important sanity check, we need to verify that EPP on 
either EHMM issues the same predictions. 

Theorem 2 (Invariance). Let f/s denote either fr or si. Fix 
equivalent EHMMs fj 1 and $) 2 . Then for all data x% : t 



P%(X 1:T ) = P£(Z1:T) 



f/s 



Proof. Given in Appendix [C] 



□ 



Thus, from the perspective of predictive performance, the 
difference between f) 1 and S) 2 is irrelevant. Of course, it 



does matter for the computational cost of EPP, see [5.2.3 



t£l:T 



5.2.2 Mixing Schemes 

Bousquet and Warmuth [ 1 1 provide an extensive discussion 
of possible mixing schemes. Their loss bounds for various 
schemes carry over directly to our setting. It is interesting, 
however, to analyse the running times of the Fixed-Share to 
uniform past and to decaying past mixing schemes for EPP. 
For further information we refer the reader to 0] . 

Both schemes (see Table [T| depend on a switching rate 
a E [0, 1], which determines whether to continue with yes- 
terday's posterior or switch back to an earlier one: fit+i (t) = 
1 - a and £o<j<t Pt+iU) = «• 

Uniform Past Given the choice to switch back, the uni- 
form past mixing scheme gives equal weights to the entire 
past: P t+ i(j) = a/t for < j < t. 

Decaying Past Instead, the decaying past scheme assigns 
larger weight to the recent past: Pt+i{j) — oc{t — j) 1 /Zt 
for < j < t, where Z t — 53 <, <t (i — j) ' is a normal- 
ising constant and 7 > is a parameter that determines the 
rate of decay. 

5.2.3 Running Times 

Appendix |A] provides a detailed comparison of the running 
times and space requirements of EPP and the forward al- 
gorithm. The upshot is that for the uniform past mixing 
scheme the sleeping variant of EPP is as efficient as the for- 
ward algorithm, in terms of both running time and space re- 
quirements; the freezing variant is equally efficient if the set 
of hidden states Q is finite, but may be a factor 0(T) less 
efficient on T outcomes for countably infinite Q. The decay- 
ing past mixing scheme is a factor 0(T) less efficient (for 
both time and space) than uniform past in all cases, but may 



be approximated by a scheme described in HI that reduces 
this factor to O(logT). 

5.3 Loss Bound 

This bounds relates the performance of EPP-FREEZING and 
EPP- Sleeping (defined in Algorithm [1} to that of #c and 
.f)j! for all partitions C jointly. 

Theorem 3 (EPP Loss Bounds). For both f/s G {fr, si} and 

any mixing scheme f3, data X\-.t and expert predictions pf. T 



Proof. Given in Appendix [B] 



(5) 



□ 



Using this bound, we can relate the predictive performance 
of EPP-Sleeping and EPP-Freezing to that of and 
Sy^ for any reference partition C. 



P£(*1:t) > /?(C)^ S (z 1:T ). 



Corollary 2. 

From the brutal way in which Corollary [2] was obtained, we 
may expect to often do much better in practice; many parti- 
tions may contribute significantly to Q. 

5.4 Recovering MPP 

We now substantiate our claim that EPP generalises MPP by 
proving that MPP results from running EPP-FREE ZIN G or 
EPP-Sleeping on the Bayesian EHMM (Example |4~2i. 



Theorem 4. Let be the Bayesian EHMM with initial dis- 
tribution w, and let P^ PP denote the probability distribution 
defined by MPP with prior w. Then for all data X\-t 

p%( XUT ) = p%( X1:T ) = prw). 

Proof. The difference between freezing and sleeping (line|8]l 
evaporates since state evolution is the identity operation. By 
identifying states and experts the MPP algorithm [ 1 , Figure 
1] remains. □ 

The theorem does not require the set of experts £ to be fi- 
nite. If £ is infinite (or too large), MPP is intractable. Still, 
a small EHMM may exist that implements Bayes (say with 
the uniform prior) on £ , and we can use EPP-SLEEPING 
(which is faster than EPP-FREEZING) for sequential predic- 
tion. For example, we may implement MPP on the infinite 
set of Bernoulli experts (cf the example in the introduction) 
efficiently, in time 0(T 2 ), using EPP-SLEEPING on the uni- 
versal element-wise mixture EHMM of [9, §4.1]. 

5.4.1 Improved MPP Loss Bound 

0] Theorem 7] (our Theorem [TJ bounds the overhead of 
MPP over Freund's scheme in terms of /3(C), the complexity 
of the reference partition C according to the mixing scheme 
/?. A more general bound follows directly from Theorems [3] 
andH 



Corollary 3. 



>MPP 



(xx-.t) > y>(C)Pg(a;i : T). 



Even with a fixed reference partition C in mind, we get a bet- 
ter bound by considering small modifications of C, e.g. finer 
partitions or partitions that disagree about a single round. 



Adversarial Experts For each number of rounds T one 
can construct a set of T base experts and data x\-t such that 
the loss of Freund's scheme under the MPP interpretation is 
infinite for all partitions except the finest one. We simply 
have expert t suffer infinite loss in all rounds other than t. 
In this pathological case the bounds in Theorem [T] for that 
partition and Corollary [3] are equal and tight. 

5.4.2 Is EPP strictly more general than MPP? 

A natural question is whether either EPP-SLEEPING or EPP- 
FREEZING can be implemented using MPP on a rich set of 
meta-experts. To preclude the trivial answer that regards 
either algorithm as a single meta-expert, we ask for a fixed 
construction that works for all mixing schemes. 

Sleeping For any EHMM Sj, EPP-Sleeping can be re- 
duced to MPP on meta-experts. Let the set of meta-experts 
be <2°°, the set of paths through the hidden states of S). Each 
meta-expert predicts x t using the p? 1 -mixture of base ex- 
pert predictions. We set the prior w in MPP equal to the 
marginal probability measure of S) on paths (as determined 
by p and p_J. We omit the proof that the predictions made 
by MPP on these meta-experts with prior w are equal to those 
made by EPP on $). 

Freezing The next example shows that EPP-Freezing 
really is more general than MPP. Fix two experts £ — {a, b}. 
Consider the EHMM f) that predicts the first outcome using 
expert a, and the second outcome using expert b, i.e. Q = £, 
and p (a) = pl(b) = p^(q) = 1. Running EPP-FREEZING 
on fj results in 7r 2 (b) = ^(a) = 1, so that the first outcome 
is predicted using expert a, and the second outcome is pre- 
dicted using the /^-mixture of experts. Thus any candidate 
meta-expert must predict the first outcome using base expert 
a. But that means that for MPP with prior w on meta-experts, 
the loss update has no effect, so that w = -k\ = 7:2 = A2. 
Hence the second outcome will be predicted according to the 
prior mixture of experts. Since /?2 is arbitrary and w is fixed, 
there can be no general scheme to reduce EPP-FREEZING to 
MPP. 

6 Other Loss Functions 

We will now show how the EPP algorithm for logarithmic 
loss can be directly translated into an algorithm with corres- 
ponding loss bound for any other mixable loss function. The 
same construction works for any logarithmic loss algorithm 
that predicts according to a mixture of the experts' predic- 
tions at each trial and whose predictions only depend on the 
experts' past losses on outcomes that actually occurred. 

Mixability A loss function I: A x X — > [0, 00] is called 
rj-mixable for 77 > if any distribution p on experts £ can 
be mapped to a single action Pred(p) G A in a way that 
guarantees that 



e(Pred{p),x) < -Mog E cxp(~ V £(a e , x)) 



(6) 



for all outcomes x € X and expert predictions a . It is called 
mixable if it is 77-mixable for some 77 > [2 |. Mixability 
ensures that expert predictions for I loss can be mixed in 
essentially the same way as for log loss. 



For example, logarithmic loss itself is 1-mixable. And 

for A = [0, 1] and X — {0, 1} the square loss £(a,x) := 
(a — x) 2 is 2-mixable and the Hellinger loss £(a, x) := 

((VT^a;- VT^a) 2 + (V^- y/a))/2 is v^-mixable.GJE] 

The Benefits of Lying Given data x\-t and expert predic- 
tions af. t , let £\. t :— £{a\,xx), ■ ■ ■ , £(a%,x t ) denote the se- 
quence of losses of expert e, and let l\. t denote these losses 
jointly for all experts. In the special case that I is the logarit- 
mic loss we write ££\. t and ££\. t , respectively. 

Suppose Alg is an algorithm for log loss that predicts 
each outcome xt by mixing the experts' predictions pf ac- 
cording to the distribution p a ' 9 [a; <t , ££ £ <t \ on experts. The 

square-bracket expression indicates that pf 9 may depend on 
the past outcomes xi-.t-i and the losses of the experts on 
these outcomes, but not on the experts' past or current pre- 
dictions in any other way. Following this convention, the 
algorithm predicts xt using: 

pf 9 [x <t ,££ e <t }(x t ) := 5>f 9 [z <t ,< ( ](e) P ^). 

e 

Now for any game with ?7-mixable loss I and the same set of 
experts 8, we can derive from Alg an algorithm Alg^' that 
predicts Xt according to 

af* := Predtpf 9 ^,^]). 

Note that Alg^ is lying to Alg: while Alg thinks it is play- 
ing a game for log loss in which experts have incurred log 
losses rj £ <t , in reality Alg^ is playing a game for loss £ and 
is feeding Alg fake inputs and redirecting Alg's outputs. 
Let us now analyse the loss of the derived algorithm Alg^,'. 

Lemma 1 (Other Loss Functions). Suppose Alg is an al- 
gorithm for logarithmic loss that predicts according to 
pf 9 [a;<t, ££%A at each time t, £ is an r]-mixable loss func- 
tion, and f{xi-T,(-i-T) i s an arbitrary function that maps 
outcomes and expert losses to real numbers. Then any log 
loss bound for ALG of the form 

-logP al9 (x 1:T ) < f(x 1:T ,££ £ 1:T ) forallp{, T , (7) 
directly implies the £ loss bound for AlgJ?: 

£(af§,x 1:T ) < $f{xi, T ,vfitr) for all a^. (8) 

Proof. Construct a log loss game in which at any time t each 
expert e predicts according to a distribution P j such that 
Pt(xt) = exp(— T)£X) for the actual outcome x t and pf is 
arbitrary on other outcomes such that J2 Xt Pt ( x t) = L I n 
this game the log loss of Alg is 

-logP al9 (x 1:T )= J2 -logp?*lx <t , V £ £ <t }(x t ). 
tei-.T 

By ^-mixability of £ 

£(<h§,xi :T ) = ^ f(Pred(pf 9 [a ;<t ,r ? 4 t ]),^) 

tel:T 

< i ~\ogpf g {x <uV £ £ <t }(x t ). (9) 
tei-.T 

Combining with |7| and |9]) completes the proof. □ 



Algorithms that satisfy the requirements of the lemma 
include Bayes, follow the (perturbed) leader, the forward al- 
gorithm, MPP and EPP. An algorithm that does not satisfy 
them is the last-step minimax algorithm 1 13 1, because it takes 
into account the experts' predictions on outcomes that do not 
occur. 

In the literature it is common to construct algorithms for 
arbitrary mixable losses and point out their probabilistic in- 
terpretation for the special case of log loss flTj [8l [TJ . Instead, 
we have proceeded the other way around: first we derived 
results for log loss and then we showed that they general- 
ise to other losses. This allowed us to draw on concepts and 
results from probability theory like conditional probabilities, 
HMMs and the forward algorithm, without reproving them 
in a more general setting. 

Lemma [T] generalises results by Vovk ifTTl . who shows 
that the most important loss bounds for Bayes with logar- 
ithmic loss can actually also be derived for arbitrary mixable 
losses. Our algorithm Alg plays a role similar to his APA 
algorithm. 

7 Discussion 

Reiearning vs Continuing to Learn Corollary [2] bounds 
the regret of EPP with respect to a reference partition C 
by — log /3(C). Consider the asymptotic behaviour of this 
bound if C has infinitely many shifts. (A shift occurs when 
prev (t + 1) 7^ t.) For both decaying past with 7 < 1 
(e.g. following recommendations in HI) and uniform past 
(see Table [Til maxo<j<t f3 t +i(j) goes to zero as a function 
of t. Thus, the cost per shift (be it to continue an earlier cell 
or to start a new one) grows without bound. On the other 
hand for fixed share (3 t +i (0) = a for all t, hence fixed share 
can start a new cell at fixed cost. It depends on the struc- 
tured expert whether continuing previously selected cells at 
increasing cost is advantageous over reiearning from scratch 
after each shift at fixed cost. For EHMM experts with a fi- 
nite state space Q (including Bayes), reiearning from scratch 
will cost at most a factor \ Q\ over learning on. This factor is 
constant, so that fixed share will eventually win. 

8 Conclusion 

We revisited Freund's problem, which asks for a strategy 
for prediction with expert advice that suffers low additional 
loss compared to Freund's reference scheme. We discussed 
the solution by Bousquet and Warmuth, which interprets the 
experts as black boxes. We proposed a new interpretation 
of Freund's scheme which is natural for learning experts, 
namely to train experts on the subsequence on which they are 
evaluated. This allows the reference scheme to exploit local 
patterns in the data, and thus makes the problem harder. 

We solved Freund's problem for structured experts that 
are represented as EHMMs, building on the work of Bousquet 
and Warmuth. We showed that our prediction strategies are 
efficient, and have desirable loss bounds that apply to all 
mixable losses. 

A Running Times 

We compare the running times on T outcomes of EPP and 
the forward algorithm, with respect to an arbitrary EHMM 



f) with a countable set of hidden states Q. For simplicity we 
assume that the sets of experts £ and outcomes X are finite. 

Let Q t denote the hidden state of f) at time t, and let p Q , 
p^ and p, denote f)'s other components. Both algorithms 
base their predictions on a distribution At on Q t at time t, 
but differ in how they update A t after observing xt- As the 
number of computations for this step depends on the size 
of the support of At and on p_>, we will need the follow- 
ing concepts. For any probability distribution p on Q, let 
Sp(p) = {q € Q | p(q) > 0} denote its support. We recurs- 
ively define Q t , the set of states reachable in exactly t steps, 
and <2<t, the set of states reachable in at most t steps, by 

Si := Sp(Po), Qt+l := IJ S P(P-)' 2 <* : = U 2i- 

Obviously, Q t C Q< t C Q holds for all t. Let g(S) := 
^ 9gS |Sp(p^)| be the number of outgoing transitions from 
any set of states S C Q. 

A.l Forward 

The forward algorithm computes At+i by conditioning At on 
Xt and applying the transition function p^ . As At has support 
Qt, the forward algorithm requires 0(g(<2t)) work per time 
step, and 0(|Qt| + |Qt+i|) space. Notice that, for finite Q, 
the number of transitions is bounded by g(S) < \Q\ 2 for any 
S. A rough upper bound on the total running time of forward 
on T outcomes is therefore 0{\ <2| 2 T), which is linear in T. 

A.l EPP 

The EPP algorithm comes in two variants: one for sleeping 
and one for freezing. For sleeping the order of the running 
time is determined by the evolution of past posteriors (line [8] 
in Algorithm[T]i; for freezing, which skips line|8 either com- 
putation of At (line [3} or of the next posterior (line |7]> is the 
dominant step. The main difference for the running times of 
the two variants, however, is that in sleeping Tij has support 
Qt at any time t, whereas for freezing ttj has support Q<j- 

A.2.1 Uniform Past 

For the uniform past mixing scheme, one can keep track of 
Sj=o n j(9t) to speed up computation of Af+i. 

Sleeping This even works for sleeping, because applying 
the state evolution to this sum in line [8] of the algorithm is 
equivalent to applying it to the individual Wj and then sum- 
ming. Consequently, sleeping requires 0(g(Qt)) work and 
0(| Qt I + |Qt+i|) space per time step, which makes it as 
efficient as the forward algorithm. 

Freezing For freezing, computing the next posterior (line|7| 
determines the running time. It requires O (g(Q<t)) work 
and 0(|Q<t+i|) space per time step. Depending on the 
EHMM f}, this may be significantly slower than the forward 
algorithm. First, for finite Q, each of Qt, Q<t and Q have 
size O(l) in t, and freezing runs in time 0{T), just like the 
forward algorithm. Second, for infinite Q, Q< t may be un- 
bounded as a function of t. Still, on T outcomes 

E 9{Q<t) < Tg(Q< T ) < TJ2 9(Qt), 



which implies that freezing is no more than a factor T slower 
than the forward algorithm. 

A.2.2 Decaying Past 

For the decaying past scheme the relative mixing weights 
of any two past posteriors change from fi t to fit+i, which 
prevents us from summing them as for uniform past. Imple- 
menting decaying past therefore slows down both the evolu- 
tion of past posteriors and computation of At by a factor of 
0(t), and increases the required space by the same factor. 
Fortunately, however, the decaying past scheme can be ap- 
proximated using a logarithmic number of uniform blocks, 
as described in Appendix C of [ 1 L This reduces the slow- 
down factor from 0(t) to 0(log i) PI Thus, both for sleeping 
and for freezing, approximated decaying past is only a factor 
O(logT) slower than uniform past on T outcomes, and re- 
quires only a factor (9(log T) more space. 

B Loss Bounds 

We identify X t with the EHMM on {Q u E u X % ) i>t with ini- 
tial distribution A t , and with the transition and production 
functions of S). So in particular Ai = $). For convenience, 
we shorten (At)^(a;e) to A f /(a;c) and (Xt)c( x c) to \f(xc). 
Thus, among others, X t {x t ) — Xf(x t ) — A f / (x t ). 

Lemma 2. For any C C t:T, interpreting Ao(-|:eo) as X\, 

Xf{x c )= J2 Pt(j)Xf(xc\xj). 
jea-.t-i 

Proof. Let 7rJ denote the past posterior ttj at the beginning 
of round t. Thus for freezing 7r* = ttj, and for sleeping 
ttj is TTj evolved t — j steps. Then by definition At(xe) = 
E*-=0 A OVj+i (^c). The operations (•) ,r and (-) sl distrib- 
ute over taking mixtures. The lemma follows from the fact 

□ 



that (n)f\xc) = nf(x c ) and (vrj)" \x c ) = ^(x c ) 



Proof of Theorem^ For any t, we view the mixing scheme 
fit as defining the distribution of a randomized choice j t £ 
0:(t — 1) for the predecessor of the tth outcome. Let j>fc := 
jk+i-.T = (jk+t, ■ ■ ■ 7 Jt) denote a vector of the choices bey- 
ond turn k. Unfortunately, some choices of j>fc are incon- 
sistent with any partition, because an element can only have 
one successor in a partition. Thus j > k is inconsistent with 
any partition if j m = j n > for k < m 7^ n < T. Let 
the predicate I(j>k) be true iff j>fe i s consistent with some 
partition. 

Some elements of j>k may indicate the start of a new cell 
of the partition. Let S(j>k) denote the set of times when j > k 
prescribes to start a new cell, i.e. S{j > k) '■= {t € k + 1:T | 
jt = 0}. For an example, consult Figure [7] 

Consistent values of j>k specify the last part of a parti- 
tion. For any 1 < t < k, we may ask which of the times 



tei-.T 



t£l:T 



In (T) it is suggested to weight each block of posteriors 
7r [ii I j2-i] by 2 — Ji)A(ji)- h seems that a marginal improve- 
ment is possible by weighting by ^2j ± <j<j a Pt(j) instead, which 
can be implemented equally efficiently for decaying past. 



Figure 7 Notation example. T = 10, k = 4, j >k 

(2, 0, 6, 7, 5, 9), S{j>k) = {6}, R 2 (j> k ) = {2, 5, 9, 10}. 
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k + 1:T will be put in the same cell as t. Let R t (j>k) denote 
this set, including t. For convenience, we abbreviate 

PU>k):= I] &C?t), 

tGfe+l:T 

^(i>fc):= II A i S (^0> fc ))» and 
ies(j >k ) 

II A ' /S (^0>,)) foralU</c, 



Ui(j 



>k) 



ield 



to name the intermediate debris arising from the incremental 
reduction of P^(xi : t)- W^-terms deal with cells that are 
completely specified by j >k , while [/-terms keep track of 
the remaining partially specified cells. The proof proceeds 
by downward induction on k, with induction hypothesis 

1] > E P(i>k)w(j >k )u k (j >k ). (io) 

<£1:T j>k-I<J>k) 

For the base case k — T the hypothesis holds with equality, 
and for fe = the hypothesis is equivalent to the desired 
result ([5j. It remains to verify that it holds for k — 1 > if it 
holds for k. To this end, fix k > 1. To prove (jTOJ, it suffices 
to show that for consistent j >k 

w(j >k )u k (j >k ) > Mk)w{j> k )u k -i{j> k \ 
jk-i(j>k) 

where j> k denotes j k: T, i-e. j k followed by j >k . We expand 
the last factor of U k {j >k ) using Lemma|2j and bound 



Uk{j>k) 



> 



jA,e0:fc-l 



Pk (jk )^jt( x Rk(j>k)\ x 3k Wk-l (j> k ) 



) A ik ( x Rk u>k) I ^ ) u k- 1 0>fc ) • 

jk-I(j>k) 



Observe that Rt(j>k) = Rt(j>k) for all 1 < t < k except 
t = j k - There are two cases. If j k = 0, then Uk-i{j> k ) = 
U k -x{j> k ) and W{j >k )\\ s {x Rk{j>k) ) = W{j> k ) ; on the 
other hand if j k > then W(j >k ) = W(j> k ). For con- 
sistent j> k , U k -i(j >k ) contains the factor Xjl(xj k ), which 
implies that 



X fk( x RkU>k)\ x 0k) U k-i(j>k) = U k -i(j> k ). 



□ 



C Invariance 



Proof of Theorem^ Let /i 1 and /j 2 be distributions on Q 1 
and Q 2 . We overload notation, and write \l x and /i 2 for the 
EHMMs f) 1 and Sj 2 with initial distribution replaced by /i 1 
and /i 2 . Recall that /i 1 and /i 2 are equivalent if /i 1 (ei : T) = 
A* 2 (ei:T) for all ei : T- Thus, fj 1 and j^ 2 are equivalent iff pi 
and p 2 are equivalent. 



To prove the theorem, we need to prove that equivalence 
is preserved by all the operations that EPP performs, i.e. tak- 
ing mixtures, performing loss update and performing state 
evolution. Mixtures of equivalent distributions are equival- 
ent, since mixing and marginalisation commute. For loss up- 
date, note that p^ 1 (xi) = /i 1 (x 1 |ei : T) = M 2 ( a; il e i:T) f° r a U 
pf and all e 1: y. Finally, for state evolution, the claim follows 
from (p_, o fJ,)(e 1:T ) = fi(E 2 -.T+i = e 1:T ). □ 
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